Step 1: Load and Prepare
The tab enables the upload and preparation of both input files. The tabpanel provides an overview on the data quality and enables the user to filter and prepare data for downstream expression analysis.
Within the side panel the user can load data and configure quality control options; in the main panel (right) interactive visualizations are shown.
Quality control and data cleansing
The user selects either LFQ (Label-free quantification) or iBAQ (Intensity Based Absolute Quantification) as intensity metric to be considered for succeeding differential expression analysis. If available, we suggest to use LFQ intensities as Eatomics was optimized for these. Internally, the intensity widget uses the selectProteinData function.
The exclude column widget allows the user to exclude samples, especially if any outliers are found while conducting initial quality analysis such as PCA. Selecting a sample here, results in the removal of that sample from the consecutive steps analysis steps.
To avoid proteins with many missing values across the samples, the user selects the minimum number of samples for which a protein must have been detected in. Internally the filter widget uses the filterProteins function.
- Meaningful gene names: As genes names are easier to interpret than peptide identifiers, the gene names are displayed primarily. As gene names can be non-unique, the user can choose to let Eatomics
- prepare unique IDs for duplicate gene names or (make.unique() R base function)
- to sum up multiple abundance values for one gene name (checkForIsoforms custom function). In the latter case, the user is informed about intensity shares.
Missing value imputation can be performed using knn (k-nearest-neighbour), MinDet or QRLIC from the imputeLCMD package or a custom implementation of Perseus’ sampling from Gaussian distribution (implemented by Matthias Ziehm).
Load the sample description/clinical data file
Select and load the clinical data input file (e.g clinicaldata.txt), as specified above.
Main panel
Our Eatomics app wraps several functions to visualize the quality control of the uploaded proteinGroups.txt file.
Screenshot on the visualization plots developed anew are only provided in the documentation. text
Principal component analysis
A common method of dimensionality reduction is principal component analysis (PCA). Inherently, PCA calculates axes of most variation (principal components) within the expression data. A common assumption is that a plot along the axes of most variation will segregate all samples/patients into groups under investigation. The user can choose which principle components to visualize in the PCA and can choose to color the samples based on the uploaded sample/clinical characteristics.
Distribution overview
Protein numbers describes the count of distinct proteins or isoforms per sample. The plot is generated by the plot_numbers() function from the DEP package which was adjusted to work without experimental design information.
Protein coverage
Protein numbers describes the count of distinct protein groups per sample.
2.0.4.7 Sample to sample heatmap
The sample-to-sample heatmap describes the biological and/or technical variability of the samples. The user can choose to use Euclidean distance or Pearson correlation as a (dis-)similarity metric. Formed clusters should resemble the sample groups under investigation.
2.0.4.8 Cumulative Protein Intensities
Protein intensities are cumulated across all samples and plotted according to their relative abundance. Colouring marks the respective quantile of the proteins. Highly abundant proteins, i.e., proteins ranked in the first quartile are colored in red and labels are specified. The top 20 ranked proteins and their cumulated intensity are given in the table to the right.
Step 2: Differential expression
limma (linear models for microarray data), is a commonly used R/Bioconductor software package for analyzing microarray and RNA-seq data. Limma functions fit statistical linear models and conduct differential expression analysis.
Eatomics uses limma to perform real time analysis of differentially expressed proteins amongst clinical parameters of choice. The resulting interactive visualization plot including volcano plots (detailed below) allows a quick and detailed overview on the differential expression.
Step 3: Enrichment score calculation (ssGSEA)
The tab provides an easy to use user-interface on the established (ssGSEA) (single-sample Gene Set Enrichment Analysis) method which is an extension of GSEA. The interactive widgets in the tab are developed based on the freely accessed codes developed by Broad institute 1. The tab currently focuses only on four MSigDB(version, v6.1) genesets namely, H- hallmark geneset, C2- KEGG geneset, C5- GO geneset and C1- positional geneset to calculate the enrich ment score and the user selects one geneset from these (similar to conventional GSEA).
Later, the resulting output,the .gct files with the respective geneset enrichment score will be auto-downloaded into the working directory, into the folder called EnrichmentScore.
A message alert will be pop-up on download completion.
Screenshot of the ssGSEA widgets
2.3 Step 4: Differential enrichment
Differential expression analysis of the pathways enriched via ssGSEA (using the previou tabpanel) in the clinical group of interest,is facilitated by the tab. The pathway enrichment scores geneset is tested for differential pathway expression using t.test. The test compares the enriched pathways differential expressed among two clinical conditions selected.
2.3.1 Visualization of differential ssGSEA
2.3.2.0 Volcano Plot
The interactive volcanoplot plots the log ratio (x-axis) vs -log10 of P value calsulated based of t.test. The bonferroni adjusted P value (adj.p< 0.05) is selected as the threshold to highlight the differentially expressed pathways.
The other two result formats including interactive data table and the detailed description panel lists the highlighted up and and down regulated pathways and details on the input parameters, respecively.
4 References
1: Krug, K., et al., A Curated Resource for Phosphosite-specific Signature Analysis. Mol Cell Proteomics, 2019. 18(3): p. 576-593.
2: Ritchie, Matthew E., et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic acids research 43.7 (2015): e47-e47.
3: Lazar, C., “imputeLCMD: a collection of methods for left-censored missing data imputation.” R package, version 2 (2015).
5. Gitlab Folder layout
Shiny R package
R
|app.R
|helpers.R
|run_app_function.R
vignettes
|About.rmd
|
man
|helpers_roxygen.R
|
markdown_docs
|Tutorial.rmd
|About.rmd
input_files
MSData
|proteinGroups.txt
ClinicalData
|clindata.txt
Datasets
MSigDB
|c1.all.v6.1.symbols.gmt
|c2.cp.kegg.v6.1.symbols.gmt
|c5.all.v6.1.symbols.gmt
|h.all.v6.1.symbols.gmt
TranscriptionFactors
|GATA-familyHuman230318.txt
|JUN-familyHuman230318.txt
|SH2-familyHuman230318.txt
|SOX-familyHuman230318.txt
|TBX-familyHuman230318.txt
EnrichmentScore
Report
|
|DESCRIPTION
|
|NAMESPACE
|
|README.md
|